An Open Linguistic Infrastructure for Annotated Corpora
نویسنده
چکیده
Annotated corpora are a fundamental resource for research and development in the field of natural language processing (NLP). Although unannotated corpora (for example, Gigaword, Wikipedia, etc.) are often used to build language models, annotations for linguistic phenomena provide a richer set of features and hence, potentially better models in the long run. It is widely accepted that a first step in the pursuit of NLP applications for any language is to develop a high quality annotated corpus with at least a basic set of annotations for phenomena such as part of speech and shallow syntax, while corpora for languages such as English, for which substantial annotated resources already exist, are increasingly being enhanced to include additional annotations for semantic and discourse phenomena (e.g., semantic roles, sense annotations, coreference, named entities, discourse structure). This is occurring for at least two reasons: first, more and deeper linguistic information, together with study of intra-level interactions, may lead to insights that can improve NLP applications; and second, in order to handle more subtle and difficult aspects of language understanding, there is a trend away from purely statistical approaches and (back) toward symbolic or rule-based approaches. Richly annotated corpora provide the raw materials for this kind of development. As a result, there is an increased demand for high quality linguistic annotations of corpora representing a wide range of phenomena, especially at the semantic level, to support machine learning and computational linguistics research in general. At the same time, there is a demand for annotated corpora representing a broad range of genres, due to the impact of domain on both syntactic and semantic characteristics. Finally, there is a keen awareness of the need for annotated corpora that are both easily accessible and available for use by anyone.
منابع مشابه
MultiMASC: An Open Linguistic Infrastructure for Language Research
This paper describes MultiMASC, which builds upon the Manually Annotated Sub-Corpus (MASC) (Ide et al., 2008; Ide et al., 2010) project, a community-based collaborative effort to create, annotate, and validate linguistic data and annotations on a broad-genre open language data. MultiMASC will extend MASC to include comparable corpora in other languages that not only represent the same genres an...
متن کاملParallel Corpora, Alignment Technologies and Further Prospects in Multilingual Resources and Technology Infrastructure
Multilingual technologies, which to a large extent are language independent, provide a powerful support for easier building of annotated linguistic resources for languages where such resources are scarce or missing. All these technologies require parallel corpora in order to achieve their ends. Parallel texts encode extremely valuable linguistic knowledge because the linguistic decisions made b...
متن کاملLinguistically Annotated Learner Corpora: Aspects of a Layered Linguistic Encoding and Standardized Representation
Linguistically annotated corpora that are stored in standardized digital form can be a valuable source of empirical insight. They can help verify linguistic generalizations and support the formulation of new hypotheses. The linguistic annotation of such corpora often is crucial for their effective exploration from a linguistic perspective. The annotation essentially serves as an index to the li...
متن کاملgraphANNIS: A Fast Query Engine for Deeply Annotated Linguistic Corpora
We present graphANNIS, a fast implementation of the established query language AQL for dealing with deeply annotated linguistic corpora. AQL builds on a graphbased abstraction for modeling and exchanging linguistic data, yet all its current implementations use relational databases as storage layer. In contrast, graphANNIS directly implements the ANNIS graph data model in main memory. We show th...
متن کاملReflections and a Proposal for a Query and Reporting Language for Richly Annotated Multiparallel Corpora
Large and open multiparallel corpora are a valuable resource for contrastive corpus linguists if the data is annotated and stored in a way that allows precise and flexible ad hoc searches. A linguistic query language should also support computational linguists in automated multilingual data mining. We review a broad range of approaches for linguistic query and reporting languages according to u...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2013